BiBERT: Accurate Fully Binarized BERT

141

bool(x)

∂x

=



1,

if |x| ≤1

0,

otherwise.

(5.29)

By applying bool(·) function, the elements in attention weight with lower value are binarized

to 0. Thus the obtained entropy-maximized attention weight can filter the crucial part of

elements. And the proposed Bi-Attention structure is finally expressed as

BA = bool (A) = bool

( 1

D



BQBK

)

,

(5.30)

Bi-Attention(BQ, BK, BV) = BABV,

(5.31)

where BV is the binarized value obtained by sign(V), BA is the binarized attention weight,

andis a well-designed Bitwise-Affine Matrix Multiplication (BAMM) operator composed

byand bitshift to align training and inference representations and perform efficient bitwise

calculation.

In a nutshell, in Bi-Attention structure, the information entropy of binarized attention

weight is maximized (as Fig. 5.14(c) shows) to alleviate its immense information degradation

and revive the attention mechanism. Bi-Attention also achieves greater efficiency since the

softmax is excluded.

5.9.2

Direction-Matching Distillation

As an optimization technique based on element-level comparison of activation, distillation

allows the binarized BERT to mimic the full-precision teacher model about intermediate

activation. However, distillation causes direction mismatch for optimization in the fully

binarized BERT baseline, leading to insufficient optimization and even harmful effects. To

address the direction mismatch occurred in fully binarized BERT baseline in the backward

propagation, the authors further proposed a DMD scheme with apposite distilled activations

and the well-constructed similarity matrices to effectively utilize knowledge from the teacher,

which optimizes the fully binarized BERT more accurately.

Their efforts first fall into reselecting the distilled activations for DMD by distilling the

upstream query Q and key K instead of attention score in DMD for distillation to utilize its

knowledge while alleviating direction mismatch. Besides, the authors also distilled the value

V to further cover all the inputs of MHA. Then, similarity pattern matrices are constructed

for distilling activation, which can be expressed as

PQ =

Q × Q

Q × Q, PK =

K × K

K × K, PV =

V × V

V × V,

(5.32)

where ∥· ∥denotes2 normalization. The corresponding PQT , PKT , PVT are constructed

in the same way by the teacher’s activation. The distillation loss is expressed as:

distill =DMD +hid +pred,

(5.33)

DMD =



l[1,L]



F∈FDMD

PFlPFT l,

(5.34)

where L denotes the number of transformer layers, FDMD = {Q, K, V}. The loss termhid

is constructed as the2 normalization form.

The overall pipeline for BiBERT is shown in Fig. 5.15. The authors conducted experi-

ments on the GLUE benchmark for binarizing various BERT-based pre-trained models. The

results listed in Table 5.7 shows that BiBERT surpasses BinaryBERT by a wide margin in

the average accuracy.